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ABSTRACT 

A study examined the effect of verbal reporting of 
students' thinking on their F~?formance during an examination. 
Subjects, 343 high school seniors, were randomly divided into 4 
experimental groups, and a different procedure for eliciting 
students' thinking during a critical thinking test was used for each 
group. A control group took the same test in paper-and-pencil format « 
Results indicated that there were no significant differences in 
either test performance or quality of thinking among the five groups. 
The results indicated that verbal reports of thinking do not 
influence students' thinking and performance during exams, making 
them a potentially useful source of validation information. (Five 
tables of data are included, and 38 references are attached.) 
(Author/RS) 
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Abstract 

Verbal reports of examinees' thinking on test items can provide useful validation data only if the verbal 
reporting does not change the course of examinees' thinking and performance. Using a completely 
randomized factorial design, 343 senior high school students were divided into five groups. In four of 
the groups, different procedures were used to elicit students' thinking as they worked through Part A of 
a critical thinking test of observation appraisal (Norris A King, 1983). In the control group, students 
took the same test in paper-and-pendl format There were no significant differences in test 
performance among the five groups, nor in the quality of thinking among the four elidtation groups. 
These results are evidence that verbal reports of thinking meet one of the necessary conditions of 
useful validation data, namely, that collecting the data not influence examinees' thinking and 
performance. Since verbal reports of thinking can also contain a wealth of information on the 
psychological processes that underlie performance, they are a potentially important source of validation 
information. 
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VERBAL REPORTS OF THINKING AS DATA 
FOR VALIDATING MULTIPLE-CHOICE TESTS 



Verbal reports of examinees' thinking are often recommended as relevant and important sources of 
evidence for validating tests (Anastasi, 1988; Cronbach, 1971; Ennis & Norris, in press; Haney & Scott, 
1987; Messkk, in press; Norris, in press-b, in press-c). Sometimes the proposed relevance is indirect! 
as when verbal reports of thinking are used to develop information processing models of test 
performance which, in turn, are directly relevant to assessing construct validity (Embretson, 1983- 
Embretson, Schneider, & Roth, 1986). Verbal reports of thinking have been used in test validation 
(Bloom & Broder, 1950; Connolly & Wantman, 1964; Haney & Scott, 1987; Kropp, 1956; McGuire, 
1963; Schuman, 1966) but, possibly because of past emphasis on behavioristic approaches, not 
extensively. With the growing emphasis on cognitive approaches, it is likely they will receive greater 
attention (Afflerbach & Johnston, 1984; Ericsson & Simon, 1980, 1984), so studies of their usefulness 
for validating tests that go beyond mere recommendations and theoretical rationales are needed. 

This study examined the relevance of the data from verbal reports of thinking on :est items for 
validating multiple-choice tests that would be taken normally in paper-and-pencil format. A necessary 
condition for the data to be relevant is that the verbal reporting not alter examinees' thinking and 
performance from what it would have been had they taken the test in its paper-and-pencil format The 
satisfaction of this condition is often taken for granted, but this assumption is not warranted. There is 
no firm evidence whkL shows whether or not asking examinees to report on their thinking white taking 
tests affects the course of their thought. The purpose of the study was to gather such evidence. 

There is pertinent evidence on the effects of verbal reporting on the course of thought from outside 
testing contexts. For example, research on eyewitness testimony has shown that testimony given in 
response to non-leading questions tends to be more accurate than testimony given in response to 
leading questions (Clifford & Scott, 1978; Date, Loftus, & Rathbun, 1978; Harris, 1973; Hilgard & 
Loftus, 1979; Liptou, 1977; Loftus, 1979; Loftus & Palmer, 1974; Marquis, Marshall, & Oskamp, 1972). 
This result is pertinent to the problem raised here to the extent that the mental processes used to 
report eyewitness testimony are the same as those used to report one's thinking on test items. Some of 
the processes are likely the same, since both activities involve memory retrieval. But not all the 
processes are likely the same: In the eyewitness testimony situation there is recall of an observation of 
an external event whereas, in the testing situation, there is recall of an internal thinking process; in the 
eyewitness testimony situation memory is probed about events in the more distant past whereas, in the 
testing situation, memory is probed about events in the very recent past. 

Evidence from research on information processing is also pertinent to determining the effect of verbal 
reporting on the course of thought Ericsson and Simon (1980, 1984) have concluded that instructions 
to verbalize thinking do not change the course of that thinking, but merely slow it down, when subjects 
are verbalizing information that would be available normally in short-term memory. However, they 
claim that specific and directive probes, especially requests for motivations and reasons, alter cognitive 
processing. These findings are particularly important because, if they generalize to the testing context, 
they cast doubt on recommendations to use such validation techniques as "analysis of reasons" 
(Messkk, in press), which probe for examinees' reasons for answers. However, it is not known whether 
or not they do generalize. 

This study addressed the following general question: Does the dictation of verbal reports of thinking 
on multiple-choice items requiring deliberative thought after the course of examinees' thinking and 
performance on those items from what it would have been had they answered the items in paper-and- 
pencil format without reporting verbally on their thinking? Only if the answer is negative can verbal 
reports of thinking on multiple-choice tests requiring deliberative thought be relevant to the validity of 
those tests in the context of their paper-and-pendl use. However, even if the answer is negative, this 
does not automatically mean that verbal reports of thinking are useful for multiple-choice test 
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validation. Maybe, for instance, the verbal reporting does not alter the course of examinees' thinking 
and performance, but reveals so little about their thinking that it is worthless. The study did not 
directly address this issue, but nevertheless provided some information on it. 

The focus of the study was the validation of multiple-choice tests that require deliberative thought. I 
am not concerned here with tests that require rote recall, but rather ones that require deliberate 
reasoning to figure out the answers. This is a broad and somewhat vague category. It includes tests of 
higher order thinking within specific school subjects, tests of critical thinking, tests of inference in 
reading, and problem solving and decision making tests. I focussed on multiple-choice tests for three 
reasons: (a) they are widely used because they fit very weU the pragmatic constraints of many testing 
situations; (b) they are widely criticized as tests of deliberative thought (e.g, McPeck, 1981; Peine, 
1986) on the grounds that they provide weak evidence on thinking processes; and (c) it is this very 
weakness (if it exists) in the evidence that multiple-choice tests provide on thinking processes that 
verbal reports of thinking can plausibly eliminate 



Method 

Sample 

Five senior high schools were chosen from communities on the east coast of Newfoundland, Canada. 
The communities ranged from single-industry fishing and industrial communities with less than 1,000 
people to a somewhat larger town of about 5,000, situated close to several similarly sized communities. 
This group of communities had a diverse economic base in fishing, government offices (including a 
police headquarters, a jail, and a court), tourism, light manufacturing, and shopping malls. The total 
sample consisted of 343 students, including all of the students in grades 10, 11, and 12 in four of the 
schools and about half of those in the other. This sample represented a broad range of student 
abilities. Although all the schools were in small communities, they were within commuting distance of 
the capital city and indeed many of the teachers commuted every day. Thus, the schools experienced 
little trouble in attracting highly qualified teachers. The student, in these schools scored at or above 
the national average on the Canadian Test of Bask Skills. 

Instrumentation 



The task was supplied by Part A of the Test on Appraising Observations (Norris & King, 1983). The 
Test on Appraising Observations is a multiple-choice test of one aspect of critical thinking, the ability 
to judge the credibility of reports of observations. The test has been rated highly in a recent survey of 
tests for assessing higher order tliinking (Arter & Salmon, 1987). Part A has 28 items written in the 
context of a traffic accident at an intersection. In each item two people, either witnesses to the accident 
or individuals involved in it, report on what they observed happening. Examinees are to judge which, if 
either, of the reports is more credible. Relevant factors to consider in making judgments include the 
observer's expertise, alertness, and conflict of interest; the satisfactoriness of the observation 
conditions; and the source of the observation and the statement reporting it. 

Here is Item 1 as an example: 

A policeman is questioning Pierre and Martine. They were in their car at the 
intersection but were not involved in the accident. Martine is the driver and Pieir?, 
who had been trying to figure out which way to go, is the map reader. 

The policeman asks Martine how many cars were at the intersection when the 
accident occurred. She answers, There were three 

Pierre says, 'No, there wire five c»r* " 
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Examinees are instructed to choose which, if either, of the two underlined statements they have more 
reason to believe. The item is intended to test ability to recognize that the driver is likely to be more 
alert to the road conditions than the map reader and, therefore, that Martine's report is more credible, 
since all other factors appear equal 

Procedure 



A completely randomized factorial design was used to study four ways of eliciting verbal reports of 
subjects' thinking as they worked on the test. Students were selected one at a time according to the 
order of alphabetical class lists. They were assigned randomly to one of five groups, either to one of 
four elicitation groups or to a control group. The groups are described in Table L An associate and I 
worked with students independently, each of us choosing the next available student on the list. 

The verbal report elicitation procedures vary in the degree to which they lead examinees to provide 
particular sorts of information. The think aloud elicitation gives subjects the freedom to report as they 
see fit, and parallels the "free report" which yields the most accurate eyewitness testimony (Loftus, 
1979). Subsequent dictations request particular types of information and are thus more directive of 
the task to be carried out. The immediate recall elicitation requests reasons for answers selected, and 
was thus used to test the efficacy of Messkk's (in press) proposed "analysis of reasons" and Ericsson's 
and Simon's (1980, 1984) claim that requests for reasons alter the course of thinking The criteria 
probe and principle probe elicitations attempt to lead examinees by the questions asked, and thus were 
used to study the generalizability of the results from eyewitness testimony research on leading 
questions. In each group, subjects were told that they could go back to change their answers at any 
time. As an example, the elicitation procedure for Item 1 are described in Table 2. 

(Insert Tables 1 and 2 about here.] 

Tape-recorded verbal reports of thinking on items 1-15 were obtained from subjects in the elicitation 
groups. These subjects completed the remaining 13 items on Part A working privately in a paper-and- 
pencil format. Subjects in the control group worked privately in a paper-and-pencil format through all 
28 items on Part A. 



From the raw data, three sets of scores were derived. The concurrent performance score for each 
subject equalled the total number of items 1-15 answered correctly according to the key provided in the 
test manual (Norris & King, 1985). The subsequent performance scon for each subject equalled the 
total answered correctly for items 16-28. The scores were called "concurrent" and "subsequent" 
because, for the elicitation groups, items 1-15 were done concurrently with verbal reporting and items 
16-28 were done subsequently to it, working privately in a paper-and-pencil format. 

A thinking score was assigned for items 1-15 for all subjects in the elicitation groups. For each item, the 
quality of each subject's critical thinking displayed in his or her verbal report was rated on a scale of 0-3 
in accord with the procedure in Norris and King (1984) and these ratings totalled over the 15 items for 
each student Thinking scores were assigned independently of the answers chosen. 

Results 

There were two main results: (a) the elicitation of verbal reports of thinking did not alter subjects' 
performance and, by inference, did not alter their thinking; and (b) the different procedures for 
eliciting verbal reports yielded essentially the same information of the quality of subjects* thinking. 

Verbal Reporting and Performance 

Two analyses support the conclusion that verbal reporting did not alter test performance. In the first, 
concurrent performance score was the dependent variable. This analysis determined whether giving 
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verbal reports of thinking affected ongoing performance. In the second analysis, subsequent 
performance score was the dependent variable. The analysis determined whether there was a carry- 
owereftet from verbal reporting, possibly as a result of learning different things through the verbal 
reporting, 

Two5x3x2x2 fixed effects analyses of variance were performed with interview group, grade level, 
in'erviewer, and sex as the independent variables. This allowed on average between 5 and 6 
observations per cell using the total sample of 343 subjects. In both analyses, the four-way interaction 
mean square was combined with the error term. 

Table 3 contains mean concurrent and subsequent performance scores for each level of the four factors 
examined. All differences among means are small, being on the order of about 0.5. Neither analysis 
showed significant interaction effects. Fx concurrent performance, there was a significant main effect 
for interviewer only QE(1^90) - 335, C < .05). No significant differences in performance were found 
T^ILv * bdtallon levek For suhsrquent performance, significant differences for interviewer 
(E(l^») - 188, n < .05), sex (E(1^90) - 7.19, p. < .01), and grade level (E(2^90) « 7.70, fi < .01) 
were icind. Again, no significant differences were found among the elidtation levels. 

[Insert Table 3 about here] 
Verbal Reporting and Quality of Thinking 

Two analyses were performed: a quantitative and a qualitative. In the quantitative analysis, thinking 
score was taken as the dependent variable and elidtation group, grade level, interviewer, and sex as 
independent variables ina4x3x2x2fix£- effects analysis of variance. This analysis allowed on 
average between 5 and 6 observations per cell yven the 271 subjects in the four elidtation groups. The 
control group was excluded from this analysis, since they had not given verbal reports of their thinkina 
and therefore could not be given thinking scores. 

Table 4 gives mean thinking scores for each level of the four factors. Differences are on the order of 1 
point or less. On the 15 item section, subjects averaged less than 1 point per item out of a total 
possible of 3 points per item. 1 No significant interaction or main effects were found. 

[Insert Table 4 about here.] 

A qualitative analysis of the course of students' thinking was conducted of a random sample of 40 10 
from each elidtation group, of the total sample of 271 students who gave verbal reports. Seven 
categories of verbal moves were derived from the verbal reports of thinking: 

1. Citing Factual Details - either recalling a factual detail given in an item prior to the one currently 
being done, recalling such a prior detail incorrectly, or stating a detail in the currenc item; 

2. Asking Rhetorical Questions - posing questions which appear to be directed to the subject himself 
or herself rather than to the interviewer; 

3. Making Evaluations - either evaluating judgments or conditions which had been explidtly stated 
previously, or evaluating ones which had not been verbalized: 

4. Constructing Supporting Assumptions - either making detailed factual assumptions specific to the 
current item, or malting more generalized assumptions of broad prindples of appraisal or causal 
laws covering more than the situation in the current item; 

5. Using Attention Control Devices - either making comments about the stage of progress reached in 
reasoning through the problem, or commenting on the direction reasoning should proceed; 
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6. Interacting with the Experiments - directing comments or questions to the experimenter; 

7. Pausing - either interjecting (Ahhh! Mmmm!), or being silent. 

The verbal reports were coded according to the seven categories and occurrences were accumulated 
across the 10 subjects for each category. No statistical analysis was performed. The data were taken as 
exploratory and examined for general trends with a view to more systematic exploration in the future. 
The question asked was whether elidtation group membership affected the course of thinking in ways 
that were detectable by the above seven categories. The frequencies of each verbal move recorded in 
Table 5 suggest little systematic difference among elidtation groups. While there are clear differences 
among the verbal move categories, with some having occurred on the order of hundreds of times and 
others on the order of tens of times, a striking feature is that the order of magnitude of the frequency 
for each verbal move is the same for each elidtation group. 

[Insert Table 5 about here.] 

Discussion 

. The results support the conclusion that verbal reports of thinking on multiple-choice test items can 
provide relevant data on the validity of the tests taken in paper-and-pencil formal. The conclusion has 
long been supported on theoretical and intuitive grounds. But it was not known whether a necessary 
condition for the relevance of verbal reports was satisfied, namely, that the reporting process not alter 
the course of thinking and performance from what it would have been had the test been taken in paper- 
and-pencil format The results provide evidence that the condition is satisfied. 

The analysis of verbal reporting and performance showed that test performance under a variety of 
elidtation procedures, from the nonkading request to think aloud to the leading questions about the 
role of specific pieces of information, is the same as performance in a paper-and-pencil sitting with no 
elidtation. The best explanation of this equivalent performance is that, on average, subjects in the 
elidtation and control groups thought equivalently. If eliciting the verbal reports altered the course of 
subjects' dunking, then this alteration should have been manifested in different performance scores 
between the elidtation groups and the control group. While theoretically possible, it is hard to imagine 
how subjects in the ehatatkm and control groups could have performed equivalently but thought 
significantly differently. 

The analysis of verbal reporting and quality of thinking showed that there were no significant 
differences in the quality of thinking, as measured by thinking scores, across the four elidtation groups. 
The qualitative analysis of verbal reports revealed that there was no essential difference in the patterns 
of verbal moves used in reporting under different elidtation procedures. These results suggest strongly 
that it is the task presented by the items and not how subjects' thinking is elidted that governs what 
they report. Overall, the results support the use of verbal reports of dunking in validating multiple- 
choice tests. 



Furthermore, the results suggest that special care need not be taken to avoid leading questions when 
eliciting reports of thinking, because examinees were not led easily when reporting on their dunking. 
Nevertheless, prudence may suggest a more cautious approach Given the evidence on the effect of 
leading questions in other domains and given that there was basically no difference in the information 
obtained using either elidtation procedure, it may be more sensible to use the least directive (think 
aloud) elidtation. A similar note of caution can be extended to Messick's (in press) proposal to 
analyze subjects' reasons for their answer choices as a source of data on validity. Given that Ericsson 
and Simon (1980, 1984) specifically caution that requests for reasons alter the course of thought and 
given that such requests seem to deliver nothing beyond a request to think aloud, the Utter approach 
might be preferred. vv 
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type II Error 

Was this experiment sufficiently powerful to detect any true differences which existed among the 
groups? There are a number of reasons that make it highly plausible that differences would have been 
detected had they been present in the population. The first is the fact that the dictation procedures 
were considerably different from the control procedure. It is quite different for high school students to 
work alone on a test in a way that normally occurs in school than to work in the presence of a stranger 
who is probing their thinking in a way that hardly ever happens in school Thus, if elidtations of verbal 
reports of thinking have an effect on the cou.se of performance, then it should have been revealed in 
differences in performance between the dictation and control groups. 

A second reason for thinking that any true differences would have been detected is that the elicitation 
procedures were considerably different from each other, but produced r , differential effects The 
leading probes were quite leading, because they made explicit suggestions to students about what could 
have affected ther choices of answers. It would have been easy for students to conform to these 
suggestions. Instead, they regularly denied that the suggested factor had anything to do with their 
thinking and proceeded to explain how their choices were made. Students seemed to report what made 
sense to them and what was consistent with their own thinking. 

Any effect on performance of the leading criteria probe and principle probe elidtations would not 
necessarily appear in the item being done. In these two treatments, students first chose their answers 
and then were asked the questions about whether specific pieces of information affected their choices. 
So, the ehdtation could not have affected their original answer choice. However, students knew they 
could change their answers at any time, but such changes were made rarely. Also, the ehdtation for 
one item could have affected performance on subsequent items. Students could have predicted on the 
basis of previous questions that they would be asked whether some specific piece of information in the 
item affected their choice. Consequently, they might have been more diligent in trying to focus on what 
was relevant. However, no effects of such a hypothesized increase in diligence were observed. This 
result is supported by the findings of Phillips (in press) which show that students did no better on a 
multiple-choice test of inference in reading, which necessarily makes the correct answers available, 
than they did on a construct-response version of the same test 

Furthermore, in the think aloud and immediate recall elidtations, students knew before they started an 
item what they would have to do, namely, report all they were dunking in the former treatment and 
give reasons for their answer choice in the latter. Therefore, these treatments could have affected the 
original answer choice on the item being done. But no differences between ehdtation groups on either 
performance or quality of thinking were found 

A third reason making the results of this experiment plausible is that effects were sought from a 
number of directions, but were found in none of them. Among the ehdtation groups, there were no 
duTerences either in the quality of students' thinking or in the patterns of verbal moves that typified 
their verbal reports. Between the ehdtation and control groups, there were no differences either for 
performance concurrent with reporting or subsequent to it. It is plausible to think that if differences 
existed they would have been detected by at least one of these methods. 

In addition to the above considerations, an analysis of the statistical power of the experiment, 
performed using techniques described in Kirk (1968, pp. 9-11, and pp. 107-108), showed <3% chance 
of a Type II error overall The analysis requires the calculation of a parameter and the use of charts 
based upon a procedure by Tang (1938). The parameter is given by: 
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- C I. W 

sum of squared treatment effects 

- size of the jth sample 




* error variance. 



In the calculation, (k-l/n* MS^ - MS^) was taken as an unbiased estimate of the sum of squared 
treatment effects, and MS^ as an unbiased estimate of the population error variance. With the 
probability of a Type I error set at 0.05 for each analysis, the probability of Type n error was calculated 
to be <1% for the analyses of verbal reporting and performance and <3% for the analysis of verbal 
reporting and quality of thinking, 

Context-Specific Effects 

In the introduction, I limited the study to verbal reports of thinking on multiple-choice tests requiring 
deliberative thought Verbal reports of thinking seem useful for validating such tests, because 
examinees plausibly would have something to iay about how they chose their answers. On a test of 
rote recall or some other automatic process, subjects by definition are unlikely to have access to their 
thinking. So, collecting verbal reports of thinking does not make sense in this latter context. This 
intuition is supported by Bereiter and Bird (1985), who also believe that verbal reports of thinking 
would be most useful in activities requiring deliberative thought Such activities would include the 
critical thinking task used in this study and other critical thinking tasks, problem solving and decision 
making tasks, subject matter tasks requiring deliberation and reflection instead of rote recall, and tests 
of reading comprehension which require deliberative thinking such as some tests of inference and other 
higher order processes in reading. 

The need and desire to think deliberatively may help explain why different dictation procedures did 
not affect thinking in the situation studied in this experiment, but why eyewitness testimony research 
consistently shows differential effects on the accuracy of verbal reports for different dictation 
procedures. Students thought deliberatively on the test because the task required it and, even though 
tl* test did not count for school grades, the students wanted to portray themselves as capable people. 
There is some evidence that subjects in eyewitness testimony experiments may not deliberate about 
their task in this way. In a critical analysis of eyewitness testimony research, McOoskey and Egeth 
(1983) contended that while laboratory research suggests that "jurors 41 place an unwarranted amount of 
confidence in eyewitness testimony, studies of real jurors do not show this tendency. Real-life jurors 
tend to be skeptical of evidence and deliberative in their thinking in order to maintain the presumption 
of innocence. Maintaining a presumption of innocence is not crucial in psychological experiments. 

Implications 

Verbal reports of thinking would be useful in the validation of multiple-choice tests of deliberative 
thinking if they could provide evidence for judging whether good thinking was in general associated 
with choosing keyed answers and poor thinking with choosing unkeyed answers. This study focussed 
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primarily on one necessary condition for this usefulness to exist, namely, that giving verbal reports does 
not alter the course of thinking and performance. But even if, as the evidence suggests, they do not 
alter thinking or performance, they must contain enough information to allow comparisons to be made 
between the quality of examinees' thinking and their chosen answers. 

In fact, the verbal reports of thinking contained a wealth of information useful for rating the quality of 
subjects' thinking and for diagnosing specific problems with items, such as the presence of misleading 
expressions, implicit dues, unfamiliar vocabulary, and alternative justifiable answers to the one keyed 
correct (Norris, in press-a, in press-c). Given the results of this study, it is reasonable to trust this 
diagnostic information as an accurate representation of problems that would occur with the items taken 
in paper-and-pencil format 

Multiple-choice tests are popular largely because of their ease of administration and scoring. But the 
source of this popularity leads to criticisms of them. One criticism is that multiple-choice tests 
intended to examine deliberative thought and not mere rote recall provide no direct evidence of the 
reasoning examinees use to choose their answers. On account of this criticism, many educators believe 
that multiple-choice testing encourages an overemphasis on getting the right answers and undervalues 
careful reasoning. A systematic procedure for quantifying and using the data in verbal reports of 
thinking for developing and validating multiple-choke tests can overcome this criticism. Multiple- 
choice tests could be developed for which the evidence from verbal reports of thinking indicate that, in 
general, sound thinking is associated with choosing keyed answers and unsound thinking with cho os ing 
unkeyed answers (Norris, in press*). Verbal reports of thinking thus offer the prospect of developing 
multiple-choice tests which can serve both the desires for efficiency and cost-effectiveness and 
educational quality. 
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Footnote 

l I have subsequently concluded that the 3-point thinking score scale was not suitable. To get 3 
points, students had to generalize beyond the specific situation of the item by referring to a general 
principle of critical thinking under which the specific case fell Hardly any students did this and I now 
believe that it is pedantic to expect it. Therefore, the effective thinking score range is 0-2 per item, or 
0-30 for the 15 items for which students gave verbal reports of their thinking. Thus, students averaged 
8.7 on the 30-point scale. 
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Table 1 

Description of Elicttation Levels 



Elidtation Level 



Description 



Bank AIitimI filirimtinn 



Subjects were instructed to report all they were 
thinking as they worked through an item and to 
mark their answer on a standardized answer sheet. 



Immediate Recall BiflHtiOB 



Subjects were asked to mark their answer to an 
item on a standardized answer sheet and to teU why 
they chose the answer they did 



Criteria Probe Bifltotkffl 



Principle Prnhe FHntfltiffll 



NoEUritatioprCnntrftl) 



Subjects were asked to mark their answer on a 
standardized answer sheet and then to teU whether 
a piece of information pointed out in the item at 
that time had made any difference to the answer 
they chose. 

Subjects were treated as in the criteria probe group 
with an additional question asking whether their 
choice of answer was based upon particular general 
principles. 

Subjects were not interviewed, but were instructed 
to work alone on the test and to mark their answers 
on a standardized answer sheet. 
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Table 2 

Verbal Report EUcitation Procedures for Item 1 



Elidtation 



Instructions to Subjects 



Think Aloud 



Try to tH! me all that comes to your mind as you 
think about this question. 



Immediate Recall 



Tell me which answer you choose and why you 
choose that answer. 



Criteria Probe 



Which answer do you choose? Didthefact 

that Pierre is the map reader affect your choice? 



Principle Probe 



Which answer do you choose? Didthefact 

that Pierre is the map reader sffect your choice? 
.... If "No," go cm to the next item. If"Yes,"ask: 
What difference did it make to your thinking that 
he is the map reader? 
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Table 3 

Mean Concurrent and Subsequent Performance for Elicitation Level, Interviewer. 
Sex, and Grade Level 



Mean Mean 
Concurrent Subsequent 
Level Performance Performance 



F lift far inn 


No r.hcitstion (Control) 


7.8 


8.4 




i nuuc Aloud 


8.0 


8.4 




Immediate Recall 


83 


83 




Criteria Probe 


7.9 


&6 




Principle Probe 


7.6 


8.1 


Interviewer 


A 


7.6 


82 




B 


82 


83 


Sex 


M 


7.7 


8.0 




F 


8.0 


8.7 


Grade Level 


10 


7.8 


7.8 




11 


7.7 


8.6 




12 


8.1 


8.8 
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Table 4 

Mean Thinking Scores for Elicitation Group, Interviewer, Sex, and Grade Level 



Level Mean Thinking Score 



ciicuation Oroup 


Think Aloud 


7.9 




Immediate Recall 


9.2 




Criteria Probe 


8.8 




Princinlp Prnh^ 


n ft 
9.0 


Interviewer 


A 


8.1 




B 


93 


Sex 


M 


92 




F 


83 


Grade Level 


10 


82 




11 


8.6 




12 


9.5 



9 

:RIC 
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Table 5 

Frequency of Verbal Moves by Elicit ation Group 



Elicitation Group 





Think 


Immed. 


Crit. 


Princ. 


veroai Moves 


Aloud 


Recall 


Probe 


Probe 


Citing Factual Details 


104 


139 


99 


139 


Asking Rhetorical Questions 


16 


9 


2 


5 


Making Evaluations 


45 


24 


39 


43 


Constructing Assumptions 


178 


228 


214 


227 


Controlling Attention 


26 


25 


13 


19 


Interacting with Experimenter 


19 


9 


12 


13 


Pausing 


499 


387 


424 


380 



?2 



